Introduction

In this project we will explores the univariate, bivariate and multivariate relationships between variables in Red Wine Quality Dataset using techniques in R.

About the dataset

This dataset is public available for research. It is a cortesy of: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.

Available at: - [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016

Variables description

Input variables (based on physicochemical tests):

  • Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  • Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  • Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

  • Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  • Chlorides: the amount of salt in the wine

  • Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  • Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  • Density: the density of water is close to that of water depending on the percent alcohol and sugar content

  • pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

  • Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

  • Alcohol: the percent alcohol content of the wine

Output variables:

  • Quality: output variable (based on sensory data, score between 0 and 10)

Loading Dataset

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Exploring Dataset

Let’s take a look in our dataset and get some important information about the variables, the structure and schema of the data.

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

After exploration, we can do some observations:
1. Large range in free.sulfur.dioxide and total.sulfur.dioxide (maybe outliers)
2. There are more ph and alcohol in the 3rd quartile
3. High median and mean in pH
4. In general, there is a low residual.sugar. However, max is too high (maybe outlier) 5. Quality doesn’t have a value greater than 8
6. As we see, quality is an ordinal categorical variable and we can create a new variable, according the note, assign labels.
7. According variable descriptions, it appears that fixed.acidity ~ volatile.acidity and free.sulfur.dioxide ~ total.sulfur.dioxide may possible by dependent.

Common functions

This is a section that contains some common functions will be used during this analysis.

Univariate Plots Section

## Warning: Removed 132 rows containing non-finite values (stat_bin).

Answering some questions about univariate analysis

What is the structure of your dataset?

The Red Wine Dataset had 1599 observations with 13 variables. All the variables are numerical, expect for quality, which is an categorical and rated it between 0 (bad) and 10 (excellent).

  • Most of the wine have quality 5 or 6 on the scale of 0-10.
  • There is no quality less than 3 and more than 8.
  • Most of the wines have pH between 3.2 and 3.4
  • Average sugar amount is 2.54 g/dm^3 with the maximum 15.5, which means all of the wine samples are not sweet.

What is/are the main feature(s) of interest in your dataset?

We want to analyse the quality, so quality is the main feature of interest.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I believe acidity, alcohol, density and pH could affect the wine quality.

Did you create any new variables from existing variables in the dataset?

Yes, we changed quality to an ordered factor and created a new variable called “quality_label” classifies wines as bad, good or very good based on quality.

Bivariate Plots Section

This correlation matrix shows quality has : - alcohol (positive correlation) - sulphates (positive correlation) - citric.acid (positive correlation) - volatile.acidity (negative correlation) - total.sulphur.dioxide (negative correlation) - density (negative correlation)

So, a ‘very good’ wine usually has:

Let’s investigate how above chemical properties affect quality of wine.

## 
##  Pearson's product-moment correlation
## 
## data:  wine_ds$alcohol and as.numeric(wine_ds$quality)
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Pearson's product-moment correlation
## 
## data:  wine_ds$sulphates and as.numeric(wine_ds$quality)
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## 
##  Pearson's product-moment correlation
## 
## data:  wine_ds$citric.acid and as.numeric(wine_ds$quality)
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

We see that alcohol, sulphates, and citric.acid are positively correlated to quality of wine, but there are some outliers on the higher end of alcohol and sulphates for the wine of rating 5 for the quality. This says there might be other factors which decide the quality of the wine.

## 
##  Pearson's product-moment correlation
## 
## data:  wine_ds$volatile.acidity and as.numeric(wine_ds$quality)
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  wine_ds$total.sulfur.dioxide and as.numeric(wine_ds$quality)
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003
## 
##  Pearson's product-moment correlation
## 
## data:  wine_ds$density and as.numeric(wine_ds$quality)
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

We see that volatile.acidity, total.sulphur.dioxide and density are inversely correlated to quality of the wine but there are some outliers which shows that there are other factors which decide the quality of the wine.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • Correlation plot helped to understand the correlation among different features. Quality is strongly correlated positively with alcohol and sulfates, and negatively with volatile acidity. Good wines have lower pH values, which also goes with having more fixed and citric acid.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  • Citric acid and fixed acidity have a strong positive correlation of 0.67
bivariate_scatterplot(wine_ds$citric.acid, wine_ds$fixed.acidity, 'citric.acid', 'fixed.acidity')

cor.test(wine_ds$citric.acid, as.numeric(wine_ds$fixed.acidity))
## 
##  Pearson's product-moment correlation
## 
## data:  wine_ds$citric.acid and as.numeric(wine_ds$fixed.acidity)
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

What was the strongest relationship you found?

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
##       residual.sugar            chlordies  free.sulfur.dioxide 
##           0.01373164          -0.12890656          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##            sulphates              alcohol 
##           0.25139708           0.47616632

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  • I observed that higher quantity of sulphates and citric.acid makes a good wine but with lower volatile acidity of course. Lower density and higher alcohol content makes a good wine.

Were there any interesting or surprising interactions between features?

  • A well balanced combination of pH and fixed.acidity makes good wine.

Final Plots and Summary

Plot One: Quality of wine

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Description

This graph explains the most of wines are rated of quality 5 and 6.

Plot Two: Effect of Alcohol

Description

  • From the above plot we can confirm that Quality of the Red wine is directly proportional to Alcohol %. Alcohol has the strongest correlation with quality. As the alcoholic content increases, typically the quality of wine does as well.

Plot Three: Alcohol and volatile acidity

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

Description

  • The negative relationship between alcohol and residual sugar is deteched. Although the variance is quite high, the smoothing curve shows the average residual sugar by alcohol. It is interesting to see that residual.sugar decreased by increasing alcohol significantly.

Reflection